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ABSTRACT 

Modeling  human  decision  making  in  strategic  problem  do¬ 
mains  is  difficult  with  normative  game  theoretic  approaches. 
Behavioral  aspects  of  this  type  of  decision  making,  such  as 
forgetfulness  or  misattribution  of  reward,  require  additional 
parameters  to  capture  their  effect  on  decisions.  We  propose 
a  descriptive  model  utilizing  aspects  of  behavioral  game  the¬ 
ory,  machine  learning,  and  prospect  theory  that  replicates 
the  behavior  of  humans  in  uncertain  strategic  environments. 
We  test  the  predictive  capabilities  of  this  model  over  data 
from  43  participants  guiding  a  simulated  Uninhabited  Aerial 
Vehicle  (UAV)  against  an  unknown  automated  opponent. 

Categories  and  Subject  Descriptors 

1.2  [Artificial  Intelligence]:  Learning — Parameter  learn¬ 
ing 

General  Terms 

Human  Factors,  Experimentation 

Keywords 

reinforcement  learning,  behavioral  game  theory,  human  de¬ 
cision  making,  models 

1.  INTRODUCTION 

In  strategic,  uncertain  environments,  human  decision  mak¬ 
ing  may  not  always  adhere  to  normative  decision  theoretic 
models.  When  tasked  with  making  decisions  in  these  do¬ 
mains,  humans  do  not  always  exhibit  a  clear  memory  of  past 
experiences.  In  addition,  rewards  from  neighboring  strate¬ 
gies  may  have  an  impact  on  decisions,  as  humans  tend  to 
spill  over  rewards  from  one  strategy  to  another  [9],  Essen¬ 


tially,  human  decision  making  patterns  include  several  cog¬ 
nitive  biases  which  influence  their  chosen  strategy. 

Several  behavioral  game  theory  models  exist  for  represent¬ 
ing  human  decision  making  [1,  2,  7,  8].  Many  of  these  models 
rely  upon  reinforcement  learning  and  represent  learning  as 
the  perceived  reward  of  interaction  within  an  environment. 
The  application  of  these  game  theory  models  is  limited  to 
single-shot  and  repeated  games  which  are  represented  in  nor¬ 
mal  form. 

In  real-world  strategic  domains,  the  environments  are  largely 
sequential  and  uncertain.  Reinforcement  learning  is  well  ex¬ 
plored  in  these  types  of  problem  domains,  for  which  the 
popular  reinforcement  learning  technique,  Q-learning,  has 
been  developed  [5] .  The  Q-learning  function  determines  the 
optimal  set  of  strategies  to  maximize  the  total  reward  by  an¬ 
alyzing  immediate  rewards  and  potential  future  rewards  as 
a  game  progresses  from  state  to  state.  Current  applications 
of  this  technique  apply  to  purely  rational  decision  making. 

This  paper  presents  a  study  conducted  with  human  sub¬ 
jects  to  observe  decision  making  patterns.  Participants  in 
these  studies  were  given  the  task  of  observing  an  unmanned 
aerial  vehicle  (UAV)  navigate  through  a  series  of  sectors  (in 
a  4x4  grid)  and  assessing  the  likelihood  of  their  UAV  reach¬ 
ing  a  goal  sector  without  being  detected  by  an  automated 
enemy  UAV  (whose  location  is  largely  unknown).  The  pri¬ 
mary  hypothesis  of  this  experiment  was  to  determine  if  in- 
centivizing  their  assessment  via  proper  scoring  rules  would 
improve  assessment  techniques.  The  secondary  hypothesis, 
and  the  focus  of  this  paper,  was  to  discover  if  participants 
were  learning  in  this  environment  and,  if  so,  to  model  the 
participants’  learning.  While  the  investigation  into  incen¬ 
tives  does  not  prove  to  be  a  significant  result,  we  observe  re¬ 
markable  learning  and  provide  an  aggregate  learning  model. 

Reinforcement  learning  is  a  convincing  model  for  this  do¬ 
main.  The  UAV  problem,  while  including  another  agent, 
can  be  modeled  as  a  single-player  game,  where  the  partici¬ 
pant  does  not  model  the  enemy.  The  enemy  UAV  is  revealed 
to  the  participant  as  moving  in  a  deterministic  fashion.  The 
participant  will  always  lose  if  they  follow  the  same  trajec¬ 
tory  and  are  in  the  same  state  (after  the  same  amount  of 
moves)  that  caused  a  loss  in  a  previous  iteration  of  the  game. 
Therein,  the  enemy  is  a  part  of  the  game’s  environment,  and 


need  not  be  modeled  explicitly  by  the  participant. 

The  task  of  probability  assessment  in  human  decision  mak¬ 
ing  is  also  subject  to  biases  [6].  When  a  participant  states 
their  probability  assessment,  it  may  not  be  equivalent  to 
their  believed  probability  of  success.  When  rewards  are  non- 
deterministic,  such  as  in  gambling,  there  is  much  evidence 
that  humans,  in  general,  underweight  or  overweight  their  as¬ 
sessments  at  the  extreme  cases  (near  0%  or  100%)  [4].  Sub¬ 
proportional  probability  weighting  functions  map  believed 
probabilities  to  expressed  assessments,  which  is  generally 
not  a  linear  mapping,  as  in  the  normative  case. 

While  behavioral  game  theory,  sequential  reinforcement 
learning,  and  probability  assessment  mapping  are  well  ex¬ 
plored,  combining  them  to  a  single  model  is  a  novel  ap¬ 
proach.  We  establish  a  formal  model  that  attributes  behav¬ 
ioral  affects  to  sequential  domains  of  uncertainty  and  aug¬ 
ment  assessment  with  a  subproportional  probability  weight¬ 
ing  function.  We  test  its  predictive  capabilities  over  a  data 
set  of  43  participants.  Our  results  indicate  that  this  de¬ 
scriptive  version  of  the  Q-learning  model  shows  significant 
gains  over  the  respective  normative  version,  as  well  as  other 
baseline  comparative  models. 

By  utilizing  a  behavioral  game  theoretic  model  to  predict 
human  decision  making,  we  can  gain  insight  into  the  biases 
that  humans  suffer  from  when  faced  with  strategic  uncer¬ 
tainty.  Models,  such  as  our  descriptive  Q-learning  model, 
are  able  to  illustrate  human  learning  and  predict  decisions 
that  they  make  in  strategic  domains.  Analyzing  the  pa¬ 
rameters  fit  to  these  models  measures  the  impact  that  these 
cognitive  biases  have. 

2.  EXPERIMENT:  PROBABILITY  ASSESS¬ 
MENT  FOR  STRATEGIC  DECISION  MAK¬ 
ING 

In  a  large  study  conducted  with  human  participants,  we 
investigate  probability  assessment  elicited  during  a  strategic, 
uncertain  decision  making  game.  We  begin  with  a  descrip¬ 
tion  of  the  game  followed  by  a  discussion  of  the  methodology 
used  to  collect  participant  assessment  data.  We  conclude 
this  section  with  a  description  of  the  results  generated  in 
this  study. 

2.1  Study:  UAV  Game 

To  test  the  assessment  techniques  of  human  participants, 
we  created  a  strategic  game  of  uncertainty  utilizing  a  graph¬ 
ical  representation  of  a  gameboard.  In  this  sequential  game, 
participants  observe  a  UAV  (hereafter  participant’s  UAV) 
moving  through  a  4  x  4  sector  grid  from  an  initial  sector 
towards  a  colored  goal  sector.  Participants  are  given  the  ini¬ 
tial  location  of  another  UAV  (hereafter  enemy  UAV),  but 
no  other  information  about  its  movement  or  successive  loca¬ 
tions.  A  trial  (the  completion  of  one  trajectory)  is  consid¬ 
ered  a  "win”  if  the  participant  UAV  reaches  the  goal  sector, 
or  a  ’’loss”  if  it  is  caught  by  the  enemy  UAV. 

Fig.  1  represents  the  first  two  sectors  visited  (or  decision 
points)  of  a  trial.  The  gameboard  grants  clairvoyance  of  the 
entire  trajectory  for  the  current  trial,  the  initial  location  of 
the  enemy,  and  the  already  traveled  course. 

The  goal  of  the  experiment  was  to  gather  the  assessments 
of  the  overall  likelihood  of  a  trial’s  success  from  participants. 
Given  the  knowledge  of  the  initial  location  of  the  enemy,  as 
well  as  the  growing  knowledge  of  its  movements  based  on 
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Figure  1:  Two  decision  points  of  a  given  trial  in 
the  UAV  game.  The  participant  knows  the  enemy 
location  only  on  the  first  decision  point. 

losses,  this  game  exemplifies  a  learning  task. 

2.1.1  Participants 

43  participants  were  included  in  this  study.  Participants 
were  pulled  from  a  pool  of  undergraduate  students  taking 
introductory  psychology  courses  in  our  university.  Partic¬ 
ipants  were  paid  via  a  variety  of  payment  mechanisms  for 
their  time.  As  the  initial  hypothesis  of  incentivization  tech¬ 
niques  was  inconclusive,  we  included  all  participants,  regard¬ 
less  of  this  effect,  in  this  paper. 

2.1.2  Methodology 

Participants  play  20  total  trials  of  the  game.  Two  initial 
phases,  representing  the  training  phases  of  the  game,  con¬ 
sist  of  5  trials  each.  At  the  end  of  each  of  these  sets,  the 
participant  undergoes  an  intervention,  in  which  the  proctor 
of  the  experiment  highlights  participant  assessments  which 
are  too  high  or  too  low. 

At  each  decision  point,  the  participant  is  required  to  fill 
out  a  questionnaire.  In  the  questionnaire,  the  participant 
notes  the  direction  the  UAV  will  move  and  their  estimation 
for  the  probability  that  the  participant  UAV  will,  without 
being  caught,  arrive  in  the  next  sector  and  the  eventual  goal 
sector.  After  filling  out  the  questionnaire,  the  participant 
may  move  onto  the  next  slide  of  the  game. 

2.1.3  Results 

Participant  data  was  broken  up  into  two  discrete  data  sets: 
trials  resulting  in  wins  and  those  resulting  in  losses.  We  an¬ 
alyzed  the  data  for  trends  within  the  trial  (as  the  UAV  ap¬ 
proached  the  goal  sector)  and  between  trials  (as  participants 
became  more  familiar  with  the  game).  We  expect,  as  a  trial 
progresses,  that  a  participant  will  assess  higher  likelihoods 
of  success  as  they  approach  the  goal  sector.  Additionally, 


as  the  game  progresses,  the  participant  should  become  more 
confident  in  their  assessments. 


Table  1:  Slope  analysis  of  results 


Trend 

Estimate 

P-value 

intercept 

0.3315 

<0.0001 

slope  within  trial 

0.02053 

0.0016 

slope  across  trials 

-0.00486 

<0.0001 

(a)  Losses 


Trend 

Estimate 

P-value 

intercept 

0.5392 

<0.0001 

slope  within  trial 

0.05395 

<0.0001 

slope  across  trials 

-0.00129 

<0.0001 

(6)  Wins 


Table  1  above  annotates  the  results  of  running  a  gener¬ 
alized  linear  mixed  effect  regression  analysis  over  our  data 
with  random  intercept  and  slope  at  the  decision  point  and 
trial  level.  Our  results  indicate  that  the  estimates  given  for 
each  point  is  significant. 

When  considering  assessments  as  a  trajectory  progresses, 
participants  generally  increase  their  assessments  as  they  ap¬ 
proach  the  goal  sector.  The  rate  by  which  a  participant’s 
stated  probability  increases  for  winning  trajectories  is  greater 
than  losses.  This  is  to  be  expected,  as  participants  will  be¬ 
come  more  familiar  with  the  possible  movement  of  the  en¬ 
emy,  they  will  become  better  at  predicting  eventual  losses. 

As  participants  complete  trials,  the  slope  of  the  change  in 
elicited  probabilities  decreases  significantly.  This  decrease 
in  slope  indicates  that  participants  are  not  changing  their 
probability  assessments  as  much  as  they  were  in  previous 
trials,  representing  a  general  increase  in  confidence  of  the 
participant’s  guesses  for  both  wins  and  losses.  The  ideal 
case  is  that,  as  participants  learn  how  the  enemy  is  moving, 
their  slope  across  trials  will  approach  0. 

With  the  clear  trends  towards  generally  increasing  assess¬ 
ments  as  trials  progress  and  the  relative  growth  of  confi¬ 
dence  as  participants  complete  trials,  these  results  indicate 
a  strong  justification  for  the  application  of  a  learning  model. 

3.  DESCRIPTIVE  MODEL  FOR  REINFORCE¬ 
MENT  LEARNING 

Our  model  is  an  extension  of  the  popular  reinforcement 
learning  algorithm  known  as  Q-learning.  By  attributing  con¬ 
cepts  derived  from  behavioral  game  theory  to  Q-learning, 
we  establish  a  novel  framework  for  descriptive  reinforcement 
learning.  Additionally,  borrowing  from  concepts  in  prospect 
theory  creates  a  better  mapping  of  true  beliefs  to  expressed 
probabilities. 

3.1  Normative  Q-learning 

Q-learning  is  a  popular  machine  learning  model  for  repre¬ 
senting  learning  in  sequential  domains.  It  characterizes  the 
reinforcement  learning  problem  as  a  conjunction  of  previ¬ 
ous  information  and  future  rewards,  decayed  by  a  discount 
parameter,  7.  Q-learning  is  an  algorithm  that  exemplifies 
exploration  vs.  exploitation,  which  prefers  possible  future 
payoffs  or  previously  learned  payoffs,  respectively  [5].  This 


decision  is  mediated  by  the  learning  parameter,  a.  Equation 
1  shows  the  standard  Q-learning  function. 

Q(s,a )  =  Q(s,  a)  +  a(r(s)  +  -ymaXa'Qis' ,  a')  —Q(s,a))  (1) 

This  function  serves  as  a  powerful  mechanism  to  model 
learning  with  long-term  optimality.  However,  it  does  not 
exemplify  the  behavioral  aspects  of  human  decision  making. 
With  the  concepts  derived  from  behavioral  game  theory,  we 
can  apply  descriptive  parameters  to  the  Q-learning  function. 

3.2  Behavioral  Q-learning 

The  inspiration  for  the  descriptive  model  is  derived  from 
behavioral  game  theory.  Several  game  theoreticians  [2,  9,  3] 
have  investigated  human  biases  as  associated  with  problems 
of  decision  making.  Their  investigations  are  uniquely  in  the 
context  of  single  shot  and  repeated  games. 

3. 2. 1  Behavioral  Reinforcement  Learning 

Game  theory  seeks  to  analyze  and  explain  the  mechanisms 
by  which  decisions  are  made  [1].  Assuming  that  participants 
understand  the  game,  the  environment,  and  make  decisions 
in  a  purely  rational  manner,  applicable  game  theoretic  mod¬ 
els  will  be  able  to  predict  the  behavior  of  a  human.  This  is 
rarely  the  case  in  reality,  however.  Cognitive  biases  plague 
the  human  decision  making  process,  leading  to  seemingly 
subrational  decisions.  Behavioral  game  theory  models  learn¬ 
ing  with  these  biases  in  consideration. 

Several  models  exist  that  attempt  to  express  learning  within 
decision  making  domains.  The  reinforcement  learning  algo¬ 
rithm  portrays  learning  as  a  function  of  interaction  with  an 
environment  and  the  immediate  rewards.  As  an  individual 
moves  through  the  world,  it  experiences  stimuli  that  it  at¬ 
tributes  to  doing  a  particular  action.  Algorithmically,  the 
reinforcement  learning  algorithm  can  be  characterized  as: 

Ac(t)  =  Ac(t  —  1)  +  r  (2) 

The  attraction  to  doing  a  strategy  c  at  time  step  t  is  the 
previous  attraction  to  doing  strategy  c  and  its  immediate 
reward.  An  attraction  may  be  implemented  in  many  ways, 
but  it  is  essentially  a  concept  representing  the  desirability 
of  taking  a  particular  action. 

Insights  from  behavioral  game  theory  have  provided  pa¬ 
rameters  that  better  explains  the  irrational  behavior  that 
arises  in  human  decision  making  [2].  Such  concepts  include 
forgetfulness  (the  event  of  previous  information  degrading 
in  effect  on  future  decisions)  and  spillover  (the  phenomenon 
of  humans  attributing  rewards  to  neighboring  strategies). 
Behavioral  reinforcement  learning  can  be  expressed  as: 

Ac(t)  =  (j)Ac{t  -  1)  +  (1  —  e)r  (3) 

An{t)  =  (j>An{t  -  1)  +  (e)r  (4) 

(j)  represents  the  forgetfulness  parameter,  e  represents  the 
spillover  parameter,  and  A„  represents  the  attraction  to 
strategy  n,  which  is  a  neighboring  strategy  to  c.  Both  pa¬ 
rameters  are  bounded  between  0  and  1. 

Forgetfulness  in  the  context  of  our  domain  would  imply 
that  the  experience  from  a  previous  trial  has  a  diminished  ef¬ 
fect  on  current  experiences.  Spillover  generally  involves  the 
misattribution  (or  ’’generalization”)  of  rewards  to  neighbor¬ 
ing  strategies.  An  illustrative  example  is  that  of  the  roulette 


player  who  places  a  large  bet  on  a  particular  number,  only 
to  have  it  land  on  a  nearby  number  [9].  The  player  may 
have  his  guess  confirmed,  since  the  ball  was  near  their  bet, 
regardless  that  they  lost  the  bet. 

The  implementation  of  the  spillover  parameter  can  be  con¬ 
ceptualized  in  a  few  different  ways  for  our  UAV  domain. 
Neighboring  strategies  can  be  viewed  as  nearby  sectors,  di¬ 
rectly  adjacent  to  the  sector  arrived  at.  Since  the  enemy 
moves  in  a  deterministic  pattern,  the  amount  of  moves  that 
have  transpired  is  directly  related  to  the  current  location 
of  the  enemy.  With  this  in  mind,  spillover  can  also  occur 
between  these  time  steps.  Figure  2  exemplifies  the  various 
models  that  could  represent  spillover  in  this  domain. 

With  Camerer  et  al.’s  introduction  of  behavioral  param¬ 
eters  in  human  decision  making,  we  now  introduce  our  Q- 
learning  function  as  inspired  by  these  concepts. 

3.2.2  Modified  Q-learning  Function 

Q(s,a )  =  0Q(s,a)+a((l—e)r(s)+ymaxa>Q(s',a')—</>Q(s,a)) 

(5) 

Q(sn,  a)  =  rj)Q(sn,a )  +  a((e)r(s)  -  <j>Q{sn ,  a))  (6) 

<j>,  as  with  its  behavioral  game  theory  counterpart,  rep¬ 
resents  the  forgetfulness  parameter,  which  decays  the  value 
of  previous  information  associated  with  that  state  (in  our 
case,  waypoint  sector),  a  mediates  between  exploration  or 
exploitation,  and  additionally  decays  future  payoffs  to  better 
value  current  information  about  the  state  as  it  approaches 
1.  If  e  is  greater  than  0,  the  neighboring  states  (notated 
as  sn  and  includes  all  sectors  that  are  1  move  away)  gain  a 
fraction  of  the  reward  observed  [2]  [7] . 

The  future  payoff  calculation  in  the  Q-learning  function 
is  of  questionable  application  to  our  problem  domain,  how¬ 
ever.  In  essence,  maxa‘  assumes  that  the  future  state-action 
pairs  will  be  the  optimal  choice.  Participants  in  our  problem 
domain  do  not  select  the  movements  of  the  UAV,  however. 
With  clairvoyance  over  the  trajectory  that  the  UAV  will 
travel,  participants  are  likely  to  base  their  assessment  on 
the  path  revealed  to  them. 

Q{s,  a)  =  <j>Q{s,  a)  +a((l  -  e)r(s)  +  yQ(s',  tt(s'))  -  <j)Q{s ,  a)) 

(7) 

Equation  7  alters  the  future  payoff  function  to  represent  a 
next  state  payoff  from  the  next  sector,  determined  from  the 
path  revealed  to  the  participant,  ir(s')  represents  the  action 
determined  from  being  in  state  s’,  which,  in  our  case,  is  the 
next  sector  in  the  trajectory  for  the  given  trial. 

4.  PERFORMANCE  EVALUATION 

The  data  collected  from  the  43  participants  from  this 
study  were  broken  up  into  5  folds,  with  8-9  participants  per 
fold.  Utilizing  the  Nelder-Mead  method1,  parameters  are 
trained  over  4  folds  and  then,  to  test  the  predictive  capa¬ 
bilities  of  the  model,  tested  over  the  remaining  fold.  For  a 
baseline  comparison,  fits  were  generated  for  the  normative 
model2  and  compared  with  the  descriptive  model,  along  with 

1The  Nelder-Mead  method  is  a  downhill  simplex  method  for 
minimizing  an  objective  function 

2  The  normative  model  does  not  include  any  descriptive  pa¬ 

rameters.  4>  =  1  and  t  =  0,  while  a  is  still  trained. 


the  random  model3  and  pathological  cases4. 

Prior  to  calculating  the  fit  of  the  descriptive  model,  we 
must  convert  the  calculated  Q-value  generated  by  the  de¬ 
scriptive  model  to  a  probability  assessment  that  will  be  com¬ 
pared  to  the  participant  data.  Q-values  for  all  states  are 
initialized  to  0,  with  a  Q-value  of  1  being  allocated  to  the 
goal  state  and  -1  for  all  loss  states.  Q-values  approaching  -1, 
then,  represent  a  path  likely  to  lead  to  a  loss,  whereas  those 
approaching  1  indicate  a  possible  win  from  that  path.  To 
convert  these  values  to  assessments,  then,  involves  normal¬ 
izing  the  Q-value  between  0  and  1.  The  resulting  conversion 
is  then  used  as  the  Q-learning  function’s  assessment. 

Fits  were  generated  by  taking  the  squared  distance  be¬ 
tween  the  participant’s  stated  probability  and  the  model’s 
generated  probability  at  each  decision  point  in  the  game. 
The  model  was  subjected  to  a  simulation  of  the  game,  where 
it  was  presented  with  the  same  trajectories  and  experienced 
the  same  outcomes  as  participants.  At  each  point  where 
a  Q-value  was  updated  (following  a  simulation  of  a  leg  of 
a  trajectory),  the  distance  between  all  participants’  proba¬ 
bility  assessments  and  the  estimated  Q-value  were  squared, 
aggregated,  and  added  to  the  total  fit. 


Table  2:  Spillover  fits 


Fig.  2.b 

Fig.  2.c 

Fig.  2.d 

Fig.  2.e 

415.534 

415.924 

416.122 

409.254 

In  generating  the  results,  we  found  the  best  fits  to  adhere 
to  Figure  2.e,  annotated  in  Table  2.  This  indicates  that 
participants  considered  negative  and  positive  payouts  to  be 
irrespective  of  the  decision  point.  Essentially,  if  a  partici¬ 
pant  were  to  lose  in  sector  [1,2]  in  the  3rd  decision  point, 
they  would  evaluate  sector  [1,2]  negatively  in  the  2nd  and 
4th  decision  point  as  well,  while  also  avoiding  neighboring 
sectors  ([1,1],  [1,3],  and  [2,2])  in  those  time  steps  as  well. 

Table  3:  Descriptive  model:  parameters  and  fits 


a 

0 

e 

Fit 

0.819 

0.591 

0.537 

409.254 

Table  3  annotates  the  results  from  optimizing  our  Q-learning 
function  utilizing  the  Nelder-Mead  method.  Our  compara¬ 
tive  analysis  between  models  is  described  in  Table  4.  The 
descriptive  model  outperformed  the  normative  model  with 
p  <  0.01.  Additionally,  the  descriptive  model  had  a  better 
test  fit  than  the  random  and  pathological  models. 

4.1  Improving  Model  Predictions 

Although  our  results  are  significant,  improvements  can  be 
made  to  the  predictive  capabilities  of  our  model.  Humans 
not  only  exhibit  cognitive  biases  in  generating  their  probabil¬ 
ities,  but  they  additionally  misrepresent  those  probabilities 
[6] .  By  including  a  theoretically  sound  probability  weighting 
function,  we  improve  our  descriptive  model  by  replicating 
this  behavior. 

3The  probability  estimations  are  completely  random  for 
each  decision  point  within  a  trial. 

4  Pathological  cases  include  the  categorical  optimist  and  pes¬ 
simist  (who  always  guess  100%  and  0%,  respectively) 


(a)  (ft)  (c)  (d)  (e) 


Figure  2:  (a)  No  spillover,  (b)  local  spillover,  (c)  time  step  spillover,  (d)  fractional  time  step  and  location 
spillover,  (e)  full  time  step  and  location  spillover. 

Table  4:  Fit  comparison  for  all  models 


Descriptive 

Normative 

Random 

Optimist 

Pessimist 

409.254 

416.409 

891.182 

1052.723 

1971.333 

4.1.1  Probability  Weighting 

Prospect  theory  notes  that  the  weight  given  to  probabil¬ 
ity  assertions  and  the  associated  payoff  values  are  usually 
not  linear.  That  is,  humans  tend  to  under-  or  over-weight 
probability  assessments  in  domains  of  chance.  In  our  do¬ 
main,  participants  are  queried  with  their  assessment  for  the 
overall  success  of  their  current  trial  as  it  progresses,  which 
is  subject  to  non-linear  assessment  mappings.  To  this  end, 
we  included  a  subproportional  function  in  the  mapping  of 
Q- values  to  probability  assessments  [6]. 

w(p)  =  exp{-(-ln(p)f)  (8) 

Equation  8  defines  the  subproportional  function  for  a  given 
probability,  p.  Between  0  and  1,  the  exponent  (3  causes  the 
curve  to  be  inverse  sigmoidal.  This  indicates  that  proba¬ 
bilities  are  overweighted  when  low  and  underweighted  when 
high.  Inversely,  if  / 3  is  above  1,  the  curve  becomes  sigmoidal. 
At  1,  the  curve  is  linear,  which  is  the  normative  case.  Figure 
3  illustrates  the  curves  generated  from  example  values. 

4.1.2  Results 

We  ran  the  same  simulation  from  the  original  descriptive 
model  on  the  probability  weighting  descriptive  model.  As  in 
the  original  model,  we  also  compared  the  augmented  model 
with  the  43  participants  from  the  UAV  study,  aggregating 
fits  by  squaring  the  distance  from  the  probability  weighting 
descriptive  model’s  Q- values  to  the  participants’  probability 
assessments. 

Table  5:  Descriptive  model:  parameters  and  fits 

a  4>  e  (3  Fit 

0.677  0378  0  273  0573  401.36 


Including  Prelec’s  probability  weighting  function  improved 
the  performance  of  the  descriptive  model.  Table  5  describes 
the  averages  for  the  parameters  across  folds  and  the  fit  gen¬ 
erated  by  the  model.  Both  a  and  (j>  decreased  as  a  result  of 
the  inclusion. 

Table  6:  Comparative  Fits 
Descriptive  (Weighted)  Descriptive  (Unweighted) 

401.36  409.254 

Table  6  shows  a  side-by-side  comparison  of  the  descriptive 


model’s  fit  both  with  and  without  the  probability  weighting 
function.  A  two-tailed  T-test  of  the  distance  between  the 
each  model  version’s  generated  probability  resulted  in  a  sig¬ 
nificant  p- value  of  less  than  0.01.  Since  the  weighted  model 
is  a  significant  improvement  over  the  unweighted  model,  it 
is,  transitively,  an  improvement  over  the  normative  model 
as  well. 

5.  ANALYSIS 

5.1  Parameters 

Analysis  of  the  test  fits  for  the  descriptive  model  illumi¬ 
nated  some  behaviors  of  human  participants  in  sequential 
strategic  games.  The  first  observation  we  made  is  that  the 
higher  value  for  /3  is  representative  of  a  decision  making 
pattern  that  may  be  characteristic  of  win-or-lose  strategic 
games.  Traditionally,  in  betting  games,  participants  tend 
to  avoid  extreme  estimations  [6].  However,  in  the  unknown 
environment  of  our  particular  domain,  a  cursory  glance  at 
the  raw  data  indicates  a  predilection  towards  extreme  prob¬ 
ability  assessments,  which  our  model  corroborates. 

The  results  also  indicate  a  higher  preference  for  exploita¬ 
tion  of  knowledge  in  our  domain.  rf>  values  converged,  on 
average,  near  0.5,  with  slightly  higher  a  values.  A  <j>  value  to¬ 
wards  0.38  would  indicate  that  participants’  previous  knowl¬ 
edge  is  deteriorating  at  a  rate  of  about  a  third  of  the  reward 
from  the  last  time  the  state  was  visited,  a  tuned  around 
0.677  would  indicate  a  higher  rate  of  exploration  as  partic¬ 
ipants  move  through  the  game.  That  is,  participants  are 
valuing  new  information  at  68%  of  its  actual  reward. 

The  observation  of  the  e  parameter  bears  discussion,  as 
well.  A  spillover  rate  of  27%  is  relatively  high  in  comparison 
to  other  implementations  of  this  parameter  in  reinforcement 
learning  [2].  This  would  indicate  that  participants  were  at¬ 
tributing  around  a  quarter  of  the  received  reward  for  a  sector 
to  its  neighboring  sectors. 

5.2  Projected  probabilities 

As  with  our  cursory  analysis  of  the  data  received  from  par¬ 
ticipants,  plots  of  the  models’  probability  estimations  were 
categorized  by  wins  and  losses  when  compared  with  the  es¬ 
timates  made  by  participants. 

Figure  4  plots  the  average  probabilities  for  trials  generated 
from  the  various  models  (descriptive  with  weighting,  descrip¬ 
tive  without  weighting,  and  the  normative  model)  and  the 
data.  Figure  4. a  shows  a  relatively  similar  curve  between 
the  models  and  the  data,  with  the  descriptive  model  with 


(a)  (&)  (c) 

Figure  3:  (a)  j3  =  0.56,  (b)  /3  =  1  (linear),  (c)  /3  =  1.6 


(a)  Wins 


(6)  Losses 


Figure  4:  Trial  averages 


weighting  being  the  closest  in  overall  distance.  For  figure 
4.b,  the  shape  is  also  similar  to  the  data,  but  the  descriptive 
model  with  weighting  is  no  longer  the  closest.  As  we’ll  see 
with  later  plot  analysis,  the  models  are  less  accurate  on  the 
trials  that  result  in  a  loss,  indicative  of  a  different  type  of 
learning  and  probability  assessment  in  those  cases. 

Figure  5  shows  the  plots  for  the  averages  of  probability 
assessments  for  individual  decision  points  made  by  partici¬ 
pants  and  generated  by  the  models  for  trials  that  resulted 
in  a  win.  Trials  that  result  in  wins  can  be  categorized  into  3 
different  trajectory  lengths.  If  the  participant’s  UAV  eventu¬ 
ally  reaches  the  goal  sector,  it  will  do  so  in  4,  6,  or  8  moves. 
Figure  5. a  shows  the  overall  plot  for  averages  of  decision 
points  regardless  of  the  trial  type.  While  the  overall  fit  for 
the  descriptive  model  with  weighting  is  the  closest,  the  plot 
has  a  strange  shape.  This  is  due  to  the  different  amount  of 
data  points  for  trials  of  different  lengths  (e.g.  there  are  only 
3  trials  of  length  8,  but  there  are  11  total  trials  that  result 
in  a  win)  and  the  different  types  of  behavior  in  the  various 
trial  lengths.  Figures5.b,  5.c,  and  5.d  show  the  underlying 
behavior  for  trials  of  each  length,  with  the  descriptive  model 
with  weighting  outperforming  the  other  models  in  each  case. 

Figure  6  shows  the  plots  for  averages  of  probability  as¬ 
sessments  for  the  data  and  models  over  trials  that  consist  of 
losses.  These  trials  break  down  into  3  and  5  point  trials  and 
are  categorized  accordingly.  As  with  the  plot  for  the  loss 
trial  averages,  the  models  tend  to  perform  worse  on  decision 
point  averages  for  loss  trials.  Participants,  on  average,  start 
with  much  lower  assessments  than  with  trials  that  result  in 
a  win.  This  indicates  that  participants  are  better  at  iden¬ 
tifying  eventual  losses  and  retain  their  pessimism  as  trials 


(c)  6  points 


(d)  8  points 


Figure  5:  Decision  point  averages  (wins) 


progress.  The  models,  on  the  other  hand,  become  progres¬ 
sively  more  pessimistic.  The  data  for  the  5  point  trial  is 
completely  flat  as  there  is  only  one  trial  that  is  5  points  in 
length  (that  results  in  a  loss)  and  the  model  is  not  able  to 
acquire  enough  information  to  give  an  accurate  assessment. 

5.3  Discussion 

The  results  of  the  fitting  of  this  model  are  illuminating. 
They  are  indicative  of  the  relative  power  of  behavioral  game 
theoretic  parameters  in  a  sequential  learning  model.  The 
addition  of  a  probability  weighting  curve  further  improved 
our  results. 

Though  the  analysis  on  reinforcement  learning  in  this  do¬ 
main  indicates  a  significant  gain  from  the  inclusion  of  be¬ 
havioral  parameters,  other  competing  learning  models  can 
be  compared  as  a  baseline  for  the  effectiveness  of  reinforce¬ 
ment  learning  in  this  domain.  Several  behavioral  approaches 
to  belief-based  learning  may  be  applicable  to  the  sequential 
strategic  game  utilized  in  this  paper.  Camerer  et  al.  have 
proposed  alternative  models  to  reinforcement  learning  in  be- 
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(a)  All  points 


(6)  3  points 


(c)  5  points 


Figure  6:  Decision  point  averages  (losses) 


havioral  game  theory  that  may  beg  further  investigation. 


6.  RELATED  WORK 

Several  other  models  exist  that  seek  to  express  descrip¬ 
tive  learning  in  human  decision  making  domains.  Besides 
reinforcement  learning,  belief  learning,  experience-weighted 
attraction  learning,  imitation,  and  direction  learning  also 
represent  other  approaches  to  behavioral  game  theory  [1], 

Belief  learning  represents  learning  as  a  process  of  basing  fu¬ 
ture  considerations  on  observed  behavior  in  the  last  round 
[3] .  In  our  domain,  it  is  possible  for  participants  to  consider 
their  rewards  as  dependant  on  the  movement  of  the  enemy, 
but,  considering  the  lack  of  information  associated  with  the 
enemy,  it  is  likely  that  their  wins  and  losses  are  modeled  as 
an  aspect  of  the  environment. 

Erev  and  Roth  also  investigate  descriptive  reinforcement 
learning,  but  it  is  examined  in  repeated  stage  games,  not  the 
sequential  domain  [8] .  Many  of  the  applications  of  our  model 
are  present  in  their  work,  but  the  concept  of  uncertainty 
and  generalizations  of  strategy  are  not  implemented  in  their 
analysis. 

Our  work  extends  observations  from  Camerer  et  al.’s  Experience- 
Weighted  Attraction  model,  though  it  has  similar  shortcom¬ 
ings  as  the  Erev  and  Roth  model  [2].  This  model  is  con¬ 
textual  to  stage  games,  as  opposed  to  the  sequential  envi¬ 
ronment  of  the  UAV  and  other  strategic  problems.  Addi¬ 
tionally,  the  applicability  of  the  law  of  simulated  effect5  is 
less  pronounced  in  our  model,  as  the  payouts  for  foregone 
strategies  are  unknown. 
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